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Abstract 



In the framework of Markov Decision Processes, off-policy learning, that is the prob- 
lem of learning a linear approximation of the value function of some fixed policy from 
one trajectory possibly generated by some other policy. We briefly review on-policy learn- 
ing algorithms of the literature (gradient-based and least-squares-based) , adopting a uni- 
fied algorithmic view. Then, we highlight a systematic approach for adapting them to 
off-policy learning with eligibility traces. This leads to some known algorithms - off- 
policy LSTD(A), LSPE(A), TD(A), TDC/GQ(A) - and suggests new extensions - off- 
pohcy FPKF(A), BRM(A), gBRM(A), GTD2(A). We describe a comprehensive algorith- 
mic derivation of all algorithms in a recursive and memory-efhcent form, discuss their 
known convergence properties and illustrate their relative empirical behavior on Garnet 
problems. Our experiments suggest that the most standard algorithms on and off-policy 
LSTD(A)/LSPE(A) - and TD(A) if the feature space dimension is too large for a least- 
squares approach - perform the best. 

Keywords: Reinforcement Learning, Value Function Estimation, Off-policy Learning, 
Eligibility Traces 

1. Introduction 

We consider the problem of learning a linear approximation of the value function of some 
fixed policy in a Markov Decision Process (MDP) framework, in the most general situation 
where learning must be done from a single trajectory possibly generated by some other 
policy, also known as off-policy learning. Given samples, well-known methods for estimating 



a value function are temporal difference (TD) learning and Monte Carlo (Sutton and Barto| 



19981. TD learning with eligibility traces (Sutton and Barto 19981, known as TD(A), 



constitutes a nice bridge between both approaches, and by controlling the bias/variance 



trade-off ( Kearns and Singh 2000 1 , their use can significantly speed up learning. When the 



value function is approximated through a linear architecture, the depth A of the eligibility 



traces is also known to control the quality of approximation (Tsitsiklis and Van Roy 1997 1 . 



Overall, the use of these traces often plays an important practical role. 

There has been a significant amount of research on parametric linear approximation of 
the value function, without eligibility traces (in the on- or off-policy case). We follow the 
taxonomy proposed by Geist and Pietquin (|2013 1 , briefly recalled in Table [I] and further 
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gradient-based 



least-squares-based 



bootstrapping 



TD 



(Sutton and Barto, 19981 



FPKF 



(Choi and Van Roy 2006) 



residual 



projected fixed point 



gBRM 



( Baird 



19951 



TDC/GTD2 



BRM 



(Engel 



2005 



(Sutton et al. 20091 



Geist and Pietquin 2010b I 



LSTD (Bradtke and Barto 19961 



LSPE (Nedic and Bertsekas 2003) 



Table 1: Taxonomy of linearly parameterized estimators for value function approxima- 



tion (Geist and Pietquin 2013) 



developped in Section |2} Value function approximators can be categorized depending on the 
cost function they minimize (based on bootstrapping, on a Bellman residual minimization 
or on a projected fixed point approach) and on how it is minimized (gradient-descent- 
based or linear least-squares). Most of these algorithms have been extended to take into 
account eligibility traces, in the on-policy case. Works on extending these approaches 
(based on eligibility traces) to off-policy learning are scarser. They are summarized in 
Table ^ (algorithms in black). The first motivation of this article is to argue that it is 
conceptually simple to extend all the algorithms of Table [T] so that they can be applied to 
the off-policy setting and use eligibility traces. If this allows rederiving existing algorithms 
(in black in Table ^ , it also leads to new candidate algorithms (in red in Table p| . The 
second motivation of this work is to discuss the subtle differences between these intimately- 
related algorithms, and to provide some comparative insights on their empirical behavior (a 
topic that has to our knowledge not been considered in the literature, even in the simplest 
on-policy and no-trace situation). 



bootstrapping 

residual 
projected fixed point 



gradient-based 



least-squares-based 



off-policy TD( A) 



(Bertsekas and Yu 2009a I 



off-poUcy gBRM(A) 
GQ(A) (a.k.a. off-policy TDC(A)) 



(Maeiand Sutton 20101 



off-policy GTD2(^ 



off-pohcy FPKF (A) 

off-poUcy BRM(A) 

off-policy LSTD(A) 

off-pohc y LSPE (A) 

(IYuI |2010a|) 



Table 2: Surveyed off-policy and eligibility-traces-based approaches. Algorithms in black 
have been published before (provided references), algorithms in red are new. 



The rest of this article is organized as follows. Section |2] introduces the background of 
Markov Decision Processes, describes the state-of-the-art algorithms for learning without 
eligibility traces, and gives the fundamental idea to extend the methods to the off-policy 
situation with eligibility traces. Section [3] details this extension for the least-squares based 
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approaches: the resulting algorithms are formalized, and we derive recursive and memory- 
efficient formula for their implementation (this allows online learning without loss of gener- 
ality, all the more that half of these algorithms are recursive by their very definition) , and 
we discuss their convergence properties. Section |4] does the same job for stochastic gradient 
based approaches, which offers a smaller computational cost (linear per update, instead of 
quadratic). Last but not least. Section ^ describes an empirical comparison and Section p^ 
concludes. 

2. Background 

We consider a Markov Decision Process (MDP), that is a tuple {S,A,P,R,'j} in which S 
is a finite state space identified with {1,2,..., N}, A a finite action space, P G V{S)^^ 
the set of transition probabilities, R € M'^^ the reward function and 7 the discount factor. 
A mapping vr G V{A) is called a policy. For any policy vr, let P^ be the corresponding 
stochastic transition matrix, and R^ the vector of mean reward when following vr, i.e., 
of components SqI^ ,,[i?(s,a)]. The value V^{s) of state s for a policy n is the expected 
discounted cumulative reward starting in state s and then following the policy vr: 



V^{s) = E^ 



00 

i=0 



where E-,^ denotes the expectation induced by policy vr. The value function satisfies the 
(linear) Bellman equation: 

Vs, V^{s) = E,^,\,^^[R{s,a)+^V^{s')\. 

It can be rewritten as the fixed-point of the Bellman evaluation operator: V^ = T'^Y'^ 
where for ah V, T'^V = R'^ + jP^'V. 

In this article, we are interested in learning an approximation of this value function V^ 
under some constraints. First, we assume our approximation to be linearly parameterized: 

with 9 ^W being the parameter vector and (j)(s) G W the feature vector in state s. Also, 
we want to estimate the value function V^ (or equivalently the associated parameter 6) 
from a single finite trajectory generated using a possibly different behavioral policy vro. Let 
yUo be the stationary distribution of the stochastic matrix Pq = P'^° of the behavior policy 
VTo (we assume it exists and is unique) . Let Dq be the diagonal matrix of which the elements 
are (/Uo(si))i<j<Ar. Let $ be the matrix of feature vectors: 

^ = m)...^{N)f. 

The projection Hq onto the hypothesis space spanned by ^ with respect to the /io-quadratic 
norm, which will be central for the understanding of the algorithms, has the following 
closed-form: 



Ho = <^{<^^Do<^y^<!>^D^ 



0- 
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If ttq is different from vr, it is called an off-policy setting. Notice that all algorithms consid- 
ered in this paper use this Ho projection operator, that is the projection according to the 
observed dataQ 

Standard Algorithms for on-policy Learning without Traces. We now review ex- 
isting on-policy linearly parameterized temporal difference learning algorithms (see TablefTl). 
In this case, the behavior and target policies are the same, so we omit the subscript for the 
policy (vr) and the projection (11). We assume that a trajectory (si, ai, n, S2, . • • , Sj, Oj, rj, 
Si+i, . . . ,Sn,an,i"n,Sn+i) Sampled according to the policy vr is available, and will explain 
how to compute the i^^ iterate for several algorithms. For all j < i, let us introduce the 
empirical Bellman operator at step j: 



V ^ rj + ^V{sj 



+1) 



so that TjV is an unbiased estimate of TV{sj). 

Projected fixed point approaches aim at finding the fixed-point of the operator 
being the composition of the projection onto the hypothesis space and of the Bellman 
operator. In other words, they search for the fixed-point Ve = IlTVg, 11 being the just 
introduced projection operator. Solving the following fixed-point problem: 



i=i 



argmin^^ [TjVe^ - V^{sj] 



with a least-squares approach, corresponds to the Least-Squares Temporal Differences (LSTD) 



algorithm of Bradtke and Barto (19961. Recently, Sutton et al. (2009) proposed two algo- 



rithms reaching the same objective. Temporal Difference with gradient Correction (TDC) 
and Gradient Temporal Difference 2 (GTD2), by performing a stochastic gradient descent 
of the function 9 ^-^ \\Vg — IlTVg\\'^ which is minimal (and equal to 0) when Vg = HTVg. 

A related approach consists in building a recursive algorithm that repeatedly mimicks 
the iteration Vg. ~ IlTVg._.^. In practice, we aim at minimizing 



OJ !->• 



i=i 



Performing the minimization exactly through a least-squares method leads to the Least- 



Squares Policy Evaluation (LSPE) algorithm of Bertsekas and loffe (19961. If this minimiza- 



tion is approximated by a stochastic gradient descent, this leads to the classical Temporal 



Difference (TD) algorithm (Sutton and Barto 1998) 



1. It would certainly be interesting to consider the projection according to the stationary distribution of 
77, the (unobserved) target policy: this would reduce ofF-policy learning to on-policy learning. However, 
this would require reweighting samples according to the stationary distribution of the target policy vr, 
which is unknown and probably as difficult to estimate as the value function itself. As far as we know, 



the only work to move in this direction is the off-policy approach of Kolter ( 2011 1: samples are weighted 



such that the projection operator composed with the Bellman operator is non-expansive (so, weaker than 
finding the projection of the stationary distribution, but offering some guarantees). In this article, we 
consider only the Do projection. 



Off-policy learning with traces 



Bootstrapping approaches consist in treating value function approximation after 
seeing the i*'' transition as a supervised learning problem, by replacing the unobserved 
values V^i^Sj) at states Sj by some estimate computed from the trajectory until the transition 
{sj, Sj+i), the best such estimate being TjVq._^. This amounts to minimizing the following 
function: 

^il{%%-.-y^{^i)f- (1) 



UJ 



Choi and Van Roy (2006) proposed the Fixed-Point Kalman Filter (FPKF), a least-square 



variation of TD that minmizes exactly the function of Equation (II]). If the minimization 
is approximated by a stochastic gradient descent, this gives - again - the classical TD 



algorithm (Sutton and Barto 1998). 



Finally, residual approaches aim at minimizing the distance between the value func- 
tion and its image through the Bellman operator, \\V — t^^^h^ 
suggests the following function to minimize 



ry||-^jj. Based on a trajectory, this 



UJ 



E 



{tjV^ - Vu.{sj) 



which is a biased surrogate of the objective \\V 



ry||?,„ (for instance, see 



I Mo 



Antos et al. 



(2006)). This cost function has originally been proposed by Baird (1995) who minimized 
it using a stochastic gradient approach (this algorithm being referred here as gBRM for 
gradient-based Bellman Residual Minimization). Both the parametric Gaussian Process 



Temporal Differences (GPTD) algorithm of Engel (2005) and the linear Kalman Temporal 



Differences (KTD) algorithm of Geist and Pietquin (2010b) can be shown to minimize the 
above cost using a least-squares approach, and are thus the very same algorithrrQ that we 
will refer to as BRM (for Bellman Residual Minimization) in the remaining of this paper. 
To sum up, it thus appears that after the i transition has been observed, the above 
mentioned algorithms behave according to the following pattern: 



move from 9i-i to 6i towards the minimum of lo 



i=i 



TjV^ 



VL(s,) 



either through a least-squares approach or a stochastic gradient descent. Each of the algo- 
rithms mentionned above is obtained by substituting 9i, ^i_i, Oj^i or uj for ^. 

To^vards Off-policy Learning Avith Traces It is now easy to preview, at least at a high 
level, how one may extend the previously described algorithms so that they can deal with 
eligibility traces and off-policy learning. The idea of eligibility traces amounts to looking for 



the fixed-point of the following variation of the Bellman operator ( Bertsekas and Tsitsiklis 



1996) 



VF G R^, T^V = (1 - A) ^ A'^r'^+V 

fc=0 



2. This is only true in the Unear case. GPTD and KTD were both introduced in a more general setting: 
GPTD is nonparametric and KTD is motivated by the goal of handling nonlinearities. 
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that makes a geometric average with parameter A G (0, 1) of the powers of the original 
Behman operator T. Clearly, any fixed-point of T is a fixed-point of T^ and vice-versa. 
After some simple algebra, one can see that: 



tW 



{I-XjP)-\R + {l-XhPV) 
V + {I- XjP)-^{R + -fPV - V). 



(3) 



This leads to the following well-known temporal dijference-hased expression in some state s 



T^V{s) = V{s) + E^ 






.k=i 



Si = S 



V{s) + Y^{^\)^-^8,k{s) 



k=i 



where we recall that E-,^ means that the expectation is done according to the target policy 
TT, and where we 5ik[s) = E.^ Vk + 'yV{sk+i) — V{sk) Si = s is the expected temporal- 
difference (Sutton and Barto 19981. With A = 0, we recover the Bellman evaluation 



equation. With A = 1, this is the definition of the value function as the expected and 
discounted cumulative reward: T^V{s) = -E'7rEfcLi7^~*^A:|si = s]. 

As before, we assume that we are given a trajectory {si,ai,ri,S2, ■ ■ ■ , Sj, aj,rj, Sj+i . . . , 
Sn,an,rn, •Sn+i); except now that it may be generated from some behaviour policy possibly 
different from the target policy vr of which we want to estimate the value. We are going 
to describe how to compute the i iterate for several algorithms. For any i < k, unbiased 
estimates of the temporal difference terms 5ik{sk) can be computed through importance 



sampling (Ripley, 1987 1 . Indeed, for all s,a, let us introduce the following weight: 



pis, a) 



7r(a|s) 



7ro(a|s)' 
In our trajectory context, for any j and k, write 



l=j 
with the convention that ii k < j, p^ = 1. With these notations, 

Sik = p'^nv - pfvisk) 

is an unbiased estimate of ^ifclsfc), from which we may build an estimate Tj'^V of T^V{sj) 
(we will describe this very construction separately for the least-squares and the stochastic 
gradient as they slightly differ). Then, by replacing the empirical operator Tj in Equation (pi) 
by T- j, we get the general pattern for off-policy trace-based algorithms: 



move from 6i-i to 6i towards the minimum of lj i— )■ TJ ( T- jVc — V^(sj 



i=i 



(4) 
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either through a least-squares approach or a stochastic gradient descent after having instan- 
tiated ^ = 9i, 6i-i, Oj-i or oj. This process, including in particular the precise definition of 
the empirical operator T- j, will be further developped in the next two sectiona^ Since they 
are easier to derive, we begin by focusing on least-squares algorithms (right column of Ta- 
ble p| in Sectional Then, Section E] focuses on stochastic gradient descent-based algorithms 
(left column of Table ^ . 

3. Least-squares-based extensions to eligibility traces and off-policy 
learning 

In this section, we first consider the case of least-squares solving of the pattern described 
in Equation (Hi). At their i^^ step, the algorithms that we are about to describe will compute 
the parameter 6i by exactly solving the following problem: 

0, = argmin^ (f^^.V^ - K.(s,))' 
where we define the following empirical truncated approximation of Tx: 

Ji'' 

k=j k=j 

Though different definitions of this operator may lead to practical implementations, note 
that Ty^j only uses samples seen before time i: this very feature - considered by all existing 
works in the literature - will enable us to derive recursive and low-memory algorithms. 

Recall that a linear parameterization is chosen here, V^(si) = ^'^(p{si). We adopt the 
following notations: 

(pi = 4>{si), A(t>i = (i)i- Jpi(f>i+i and p^-^ = {-iXf^^p]'^ 

The generic cost function to be solved is therefore: 

i i 

e,=argminJ(a;;0 with J{u;0 = J2(^J^ + T. P'j~'(Pk^k - ^<P10 - <Pjojf- (5) 

Before deriving existing and new least-squares-based algorithms, as announced, some tech- 
nical lemmata are required. 

The first lemma allows computing directly the inverse of a rank-one perturbated matrix. 

Lemma 1 (Sherman- Morrison) Assume that A is an invertible n x n matrix and that 
u,v ^ M" are two vectors satisfying 1 -|- v^A~^u ^ 0. Then: 

A'^uv'^A'^ 



{A + uv'^)'^ = A-^ 



1 + v'^A-^u 



3. Note that we let the empirical operator T^^ depends on the index j of the sample (as before) but also 
on the step i of the algorithm. This will be particularly useful for the derivation of the recursive and 
memory-efScient least-squares based algorithms that we present in the next section. 
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The next lemma is simply a rewriting of imbricated sums. However, it is quite important 
here as it will allow stepping from the operator T^j (operator which depends on future of 
Sj) - forward view of eligibility traces ~ to the recursion over parameters using eligibility 



traces (dependence on only past samples) ~ backward view of eligibility traces - (see (Sutton 



and Barto 1998, Ch.7) for further discussion on backward/forward views). 
Lemma 2 Let f e R^^^ and n e N. We have: 

n n n i 

4=1 j=i i=l j=l 

We are now ready to mechanically derive the off-policy algorithms LSTD(A), LSPE(A), 
FPKF(A) and BRM(A). This is what we do in the following subsections. 

3.1 Off-policy LSTD(A) 

The off-policy LSTD(A) algorithm corresponds to instantiating Problem ([5| with ^ = Of. 



This can be solved by zeroing the gradient respectively to uj: 



e, = (^ 0,(/>J)"i Y. '^Ml(^^ + E p]~'{pk^k - A4>l9,)) 

j=l j=l k=j 



k^i) 



j=i k=j 

which, through Lemma |2] is equivalent to: 

3=1 fc=l 

Introducing the (importance based) eligibility vector Zj: 
j . j i-i 

^j = E ^kpr^ = E ^kil^y'"' Yl Pm= -fXpj-lZj-i + <t>j, (6) 

fc=l k=l m=k 

one obtains the following batch estimate: 

i i 

e, = {J2 ^iA</.J)-i ^ z,p,r, = {A,)-% (7) 



where 



Aj = ^ ZjA<f>j and 6i = E ^iPi'''r (^) 
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Thanks to Lemma hi the inverse Mj = (Ai) ^ can be computed recursively: 






This can be used to derive a recursive estimate: 



M,_iZiA0fMi_i, 



i-l 



^i'^i/'j + ^iPi^-i) 



Writing Ki the gain . 'y'^f^' — ;, this gives Algorithm 1 



Algorithm 1: Off-policy LSTD( A) 

Initialization; 

Initialize vector 6q and matrix Mq 
Set zq = 0; 

for i = 1,2, . . . do 

Observe (f)i,ri,(l)i+i ; 

Update traces ; 

Zj = 'jXpi^iZi-i + (pi ; 

Update parameters ; 



^i = ^i_i + Kiipin - ^(ffOi-i) ■ 
Mi = M,_i - Ki{Mj_^AcPiY- 



end 



This algorithm has been proposed and analyzed recently by Yu (2010a). The author 



proves the following result: if the behavior policy ttq induces an irreducible Markov chain 
and chooses with positive probability any action that may be chosen by the target policy 
TT, and if the compound (linear) operator IIoT'^ has a unique fixed-poinlrl then off-policy 
LSTD(A) converges to it almost surely. Formally, it converges to the solution 9* of the 
so-called projected fixed-point equation: 



Ve*=IioT^Ve*. 



(9) 



Using the expression of the projection Ho and the form of the Bellman operator in Equa- 
tion Q, it can be seen that 9* satisfies (see Yu (2010a) for details) 



A-^b 



4. It is not always the case, see Tsitsiklis and Van Roy ( 1997 1 for a counter-example 
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where 

A = ^'^Do{I-'yP){I-X-fP)~^<^ and b = <!>^Do{I - X-fP)~^R. (10) 



The core of the analysis of Yu (2010a I consists in showing that -^Ai and -^bi defined in 
Equation ([8| respectively converge to A and b almost surely. Through Equation ([7]), this 
implies the convergence of 9i to 6* . 

3.2 OfF-policy LSPE(A) 

The off-policy LSPE(A) algorithm corresponds to the instantiation ^ = 0j_i in Problem d5|: 



T \2 
j=l k=j 



di = argmin V((^[0,_i + V p • HpkVk - A(/)^^,_i) 
This can be solved by zeroing the gradient respectively to oj: 



j=l j=l k=j 

= 0,_i + {J2 ^j^Jr' E E ^.p'^Hpkrk - A<pie,^^). 

j=l j=l k=j 

Lemma ^ can be used (recall the definition of the eligibility vector Zj in Equation ([6| ) : 



« J 



9, = 0,_i + (^ 0,0j)-i Y. E ^kf^k'HPjrj - A<t>Je,.,) 
j=i j=i k=i 

i i 

Define the matrix Ni as follows: 

^■-^^^■^^■^ -""'-'- I + ^JN,-.^.^ ^''^ 

where the second equality follows from Lemma [TJ Let Ai and bi be defined as in the LSTD 
description in Equation (|8| . For clarity, we restate their definition along with their recursive 
writing: 

i 
^i = E ^J^^J = ^*-l + ^i^(Pl+l 

i 

6i = E ^jPj^i = ^*-l + ^iPi'^i- 
3=1 

Then, it can be seen that the LSPE(A) update is: 

ei = ei-i + Ni{bi-Aie,_i). 

10 
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Algorithm 2: Off-policy LSPE(A) 



Initialization; 

Initialize vector ^o and matrix Nq 
Set zq = 0, Aq = and Bq = 0; 

for i = 1,2, . . . do 

Observe (f)i,ri,(j)i+i; 

Update traces ; 

Zi = jXpi-iZi-i + (f)i ; 

Update parameters ; 



Ai 
bi 



N,. 



A,_i + z,A(t)l ; 



1+0/ Afi_i</-i 



6i_i + piZiVi; 
Oi^i + Ni{h - 



AiO, 



end 



The overall computation is provided in Algorithm [2j 

This algorithm, (briefly) mentioned by Yu ( |2010a[ ), generalizes the LSPE(A) algorithm 
of Bertsekas and lofFe ( |1996 | to off-policy learning. With respect to LSTD(A), which com- 
putes 6i = {Ai)~^bi {cf. Equation ([7])) at each iteration, LSPE(A) is fundamentally recursive 
(as it is based on an iterated fixed-point search). Along with the almost sure convergence 
of -^Ai and ^bi to A and b (defined in Equation (10|), it can be shown that iNi converges 
to iV = ($^bo^) 



Nedic and Bertsekas 



"^ (see for instance 
LSPE(A) behaves as: 



Oi = Oi^i + N(b - AOi.i) =Nb + {I- NA)ei^i 
or using the defintion of Hq, A, b (Equation (p^) and T^ (Equation ([s])): 
Ve, = <^e, = <^Nb + $(/ - NA)9i^i = noT^Ve,_,. 



(2003)) so that, asymptotically. 



(12) 



The behavior of this sequence depends on whether the spectral radius of HqT is smaller 



than 1 or not. Thus, the analyses of Yu (2010a) and Nedic and Bertsekas (2003) (for the 
convergence of Ni) imply the following convergence result: under the assumptions required 
for the convergence of off-policy LSTD(A), and the additional assumption that the operator 
IIoT''' has a spectral radius smaller than 1 (so that it is contracting), LSPE(A) also converges 
almost surely to the fixed-point of the compound IIoT operator. 

There are two sufficient conditions that can ensure such a desired contraction property. 
The first one is when one considers on-policy learning, as Nedic and Bertsekas ( 2003| ) did 
when they derived the first convergence proof of (on-policy) LSPE(A). When the behavior 
policy ttq is different from the target policy vr, a sufficient condition for contraction is that 
A be close enough to 1; indeed, when A tends to 1, the spectral radius of T'^ tends to 
zero and can potentially balance an expansion of the projection Hq. In the off-policy case, 
when 7 is sufficiently big, a small value of A can make HoT^ expansive (see Tsitsiklis and 



: 



TT 



: 
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Van Roy (19971 for an example in the case A = 0) and off-policy LSPE(A) will then diverge. 
Eventually, Equations Q and (12) show that when A = 1, both LSTD(A) and LSPE(A) 
asymptotically coincide (as T^V does not depend on V). 



3.3 OfT-policy FPKF(A) 

The off-policy FPKF(A) algorithm corresponds to the instantiation ^ = 9j-i in Problem (M: 



This can be solved by zeroing the gradient respectively to oj: 



T \2 



m E '/'.('/'J^.-i + E p'~Hpkrk - A</>^0,_i)), 

j=l k=j 



where Ni is the matrix introduced for LSPE(A) in Equation (11 ). For clarity, we restate its 
definition here and its recursive writing: 



A^, 



E'A,0j)-' = iv^-i-- 



Ni^KPiCJyjNi^i 



i=i 



+ (Pj Ni^icPi 



(13) 



Using Lemma [2] one obtains: 



* J 



9i = iV,(E '/'i^J^j-i + E E <t>kpiHp,rj - A4'Jek-i)). 
j=i j=i k=i 

With respect to the previously described algorithms, the difficulty here is that on the 

right side there is a dependence with all the previous terms 9k-i for 1 < k < i. Using 

the symmetry of the dot product A0j0fc_i = 6^_^A(j)j, it is possible to write a recursive 

algorithm by introducing the trace matrix Zj that integrates the subsequent values of 9k as 

follows: 

j 

orVfc6'fc_i = ^^• 
fc=i 



^i = E ^i '<Pk9k-i = Zj^i + 7Apj„i(/> 






With this notation we obtain: 



9. = 7V,(E 0,'^J^j--i + T^i^jPjrj - Z,A<t>j)). 



Using Equation (13) and a few algebraic manipulations, we end up with: 

9i = 9i_i + Ni{zipiri - ZiA(l)i). 
This is the parameter update as provided in Algorithm [3| 



It generalizes the FPKF algorithm of Choi and Van Roy (2006) that was originally 



only introduced without traces and in the on-policy case. As LSPE(A), this algorithm is 
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Algorithm 3: Off-policy FPKF(A) 



Initialization; 

Initialize vector ^o and matrix Nq 
Set zq = and Zq = 0; 

for i = 1,2, . . . do 

Observe (f)i,ri,(j)i+i; 

Update traces ; 

Zi = 'jXpi-iZi-i + (j)i ; 



Zi = ^\p,_iZi_i + -^-^^ 



Update parameters ; 



1+0/ Af,-i<^» ' 

i = Oi-i + Ni{zipiri - ZiA(j)i) ; 
end 



fundamentally recursive. However, its overall behavior is quite different. As we discussed 
for LSPE(A), iNi can be shown to tend asymptotically to iV = ($^L>o$)"^ and FPKF(A) 
iterates eventually resemble: 

Oi = Oi-i + -N{zipiri - ZiA(j)i). 

The term in brackets is a random component (that only depends on the last transition) and 
-r acts as a learning coefficient that asymptotically tends to 0. In other words, FPKF(A) 
has a stochastic approximation flavour. In particular, one can see FPKF(O) as a stochastic 
approximation of LSPE(O). Indeed, asymptotically, FPKF(O) does the following update 

Oi = 9i.i + -N{p,(j)iri - (jJiA^JOi-i), 

and one can notice that pi4>iri and (piAcff are samples of A and b to which Ai and hi 
converge through LSPE(O). When A > 0, the situation is less clear - up to the fact that 
since T^V does not depend on V , we expect FPKF to asymptotically behave like LSTD 
and LSPE when A tends to 1. 

Due to its much more involved form (notably the matrix trace Zj integrating the values 
of all the values Ok from the start) , it does not seem easy to provide a guarantee for FPKF (A) , 
even in the on-policy case. To our knowledge, there does not exist any proof of convergence 
for stochastic approximation algorithms in the off-policy case with traceq^ and a related 
result for FPKF (A) thus seems difficult. Based on the above-mentioned relation between 
FPKF(O) and LSPE(O) and the experiments we have run (see Section l5|, we conjecture that 
off-policy FPKF(A) has the same asymptotic behavior as LSPE(A). We leave the formal 
study of this algorithm for future work. 



5. An analysis of TD(A), with a simplifying assumption that forces the algorithm to stay bounded is given 
by |Yuj (j2010a|. An analysis of GQ(A) is provided by Maei and Sutton (2010 1, with an assumption on 



the second moment of the traces, which does not hold in general (see Proposition 2 in (Yu 2010a l). A 



full analysis of these algorithms thus remains to be done. See also Sections 4.1 and 4.2 
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3.4 Off-policy BRM(A) 

The off-policy BRM(A) algoritlim corresponds to tlie instantiation ^ = a; in Problem ([5|: 

i i i i 

9i = argmin^(0jcj+^pJ-i(pfcrfc-A(/.^cj)-(/.Ja;)2 = argmin^(^ pJ^i(pfcrfc-A(/.^a;))^ 
Define 



UJ&k" j^i i^^j a;tii^- j^i ^^j 



ipj^i = ^p'] ^^(pk and Zj^i = X! ^i ^PkTk- 

k=j k=j 



This yields the following batch estimate: 



I 



% = argmin^(zj^i - 4^J_,iUjf = (Ai) % (14) 



where 

i i 

^i = ^ '^j^i^J-^i and bi = Y^ ipj^iZj^i. 
3=1 i=i 

The transformation of this batch estimate into a recursive update rule is somewhat tedious 
(it involves three "trace" variables), and the details are deferred to Appendix ffl for clarity. 
The resulting BRM(A) method is provided in Algorithm H] Note that at each step, this 
algorithm involves the inversion of a 2 x 2 matrix (involving the 2x2 identity matrix I2), in- 
version that admits a straightforward analytical solution. The computational complexity of 
an iteration of BRM(A) is thus O(p^) (as for the preceding least-squares-based algorithms). 
GPTD and KTD, which are close to BRM, have also been extended with some trace 
mechanism; however, GPTD(A) ( |EngeH |2005[ f| KTD(A) ( |Geist and Pietquin| |2010a[ ) and 



the just described BRM(A) are different algorithms. Briefly, GPTD(A) is very close to 
LSTD(A) and KTD(A) uses a different Bellman operatoiFl As BRM(A) builds a linear 
system whose solution is updated recursively, it resembles LSTD(A). However, the system 
it builds is different. The following theorem, proved in Appendix [Bj partially characterizes 
the behavior of BRM(A) and its potential limitrl 

Theorem 3 Assume that the stochastic matrix Pq of the behavior policy is irreducible and 
has stationary distribution fiQ. Further assume that there exists a coefficient /3 < 1 such 
that 

y{s,a), X-fp{s,a) < P, (15) 



6. Technically, GPTD (A) is not exactly a generalization of GPTD as it does not reduce to it when A = 0. 
It is rather a variation. 

7. The corresponding loss is {f°iV{u;) - V^{sj) + 7A(r/+i_,V'(cj) - V^{sj+i))f. With A = it gives f°i 
and with A = 1 it provides Tj^. 



8. Our proof is similar to that of Proposition 4 in Bertsekas and Yu (2009b I. The overall arguments are 



the following: Equation (151 implies that the traces can be truncated at some depth I, whose influence 
on the potential limit of the algorithm vanishes when / tends to 00. For all /, the Z-truncated version of 
the algorithm can easily be analyzed through the ergodic theorem for Markov chains. Making I tend to 
00 allows tying the convergence of the original arguments to that of the truncated version. Eventually, 
the formula for the limit of the truncated algorithm is computed and one derives the limit. 
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Algorithm 4: Off-policy BRM(A) 



Initialization; 

Initialize vector ^o and matrix Cq 
Set yo = 0, So = and zq = 0; 

for i = 1,2, . . . do 

Observe (f)i,ri,(l)i+i; 

Pre-update traces ; 

Vi = {i>^Pi-ifyi-i + 1 ; 

Compute ; 



Ui 



Vi 



w. 



l>^pi- 



sJVi 
-t\pi-\ rr>, 



'-^i^i 



/y~i^4>i + 



(V^^^* + ^^-i 






S) 



i-1 



i-1 



7Vt-i 



i-1 



/Vi 



Zi-l 



Update parameters ; 

Oi = Oi-i + Ci-iU^ {h + V^Ci-lUi)-^_ {W; - VA-i) ■ 
Ci = Cj-i — Ci-iUi {I2 + ViCi-iUi) ViCi^i ; 

Post-update traces ; 

Zi = (7Api_i)zi_i + Tipiyi ; 



end 



then iAi and i6j respectively converge almost surely to 



A = <^ 



T 



T\oT 



D - jDP -jP'D + YD' + S{I - -fP) + {I- -fP' )S 



$ 



sT 



T\r^T , 



[I-jP')Q'D + S 



R" 



where we wrote: 

D = diag [{I - {X-ifP^ } >o 
Z?' = diag(p^(/-(A7)2p^)-Vo 



S = X-f{DP - -fD')Q 



and where P is the matrix whose coordinates are pss' = '^a''^{^\^)p{^^^)P{^'\^^'^)- Then, 
the BRM(X) algorithm converges with probability 1 to A~^b. 



The assumption given by Equation (15) trivially holds in the on-policy case (in which 
p{s, a) = 1 for all (s, a)) and in the off-policy case when A7 is sufficiently small with respect 
to the mismatch between policies. Note in particular that this result implies the almost sure 
convergence of the GPTD/KTD algorithms in the on-policy and no-trace case, a question 



that was still open in the literature (see for instance the conclusion of Engel (2005)). The 



matrix P, which is in general not a stochastic matrix, can have a spectral radius bigger 
than 1; Equation (15) ensures that (A7)^P has a spectral radius smaller than /3 so that 
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D and D' are well defined. Removing assumption of Equation (15) does not seem easy, 
since by tuning A7 maliciously, one may force the spectral radius of {X-f)'^P to be as close 
to 1 as one may want, which would make A and b diverge. Though the quantity A~^b 
may compensate for these divergences, our current proof technique cannot account for this 
situation and a related analysis constitutes possible future work. 

The fundamental idea behind the Bellman Residual approach is to address the compu- 
tation of the fixed-point of T'*' differently from the previous methods. Instead of computing 
the projected fixed-point as in Equation Q, one considers the following overdetermined 
system 



(Equation ([3 



$6* 

^ $6*- (/- A7P)-i(i?-h(l- A)7P$6') 
^ M^QR + {l-X)-fPQM 

with ^ = <I> — (1 — X)"fPQ^, and solves it in a least-squares sense, that is by computing 
0* = A'^^b with A = '^ "^ and b = "^ QR. One of the motivations for this approach is 
that, as opposed to the matrix A of LSTD/LSPE/FPKF, A is invertible for all values of A, 
and one can always guarantee a finite error bound with respect to the best projection (see 



Schoknecht (2002); Yu and Bertsekas (2008); Scherrer (2010)). If the goal of BRM(A) is 



to compute A and b from samples, what it actually computes {A and b as characterized in 
Theorem^ will in general be biased because the estimation is based on a single trajector^rl 
Such a bias adds an uncontrolled variance term to A and b (for instance, see Antos et al. 
(2006)); an interesting consequence is that A is always non-singulai ^'^ , More precisely. 



there are two sources of bias in the estimation: one results from the non Monte-carlo 
evaluation (the fact that A < 1) and the other from the use of the correlated importance 
sampling factors (as soon as one considers off-policy learning). The interested reader may 
check that in the on-policy case, and when A tends to 1, ^ and b coincide with A and b. 
However, in the strictly off-policy case, taking A = 1 does not prevent the bias due to the 
correlated importance sampling factors. If we have argued that LSTD/LSPE/FPKF should 
asymptotically coincide when A = 1, we see here that BRM should generally differ in an 
off-policy situation. 

4. Stochastic gradient based extensions to eligibility traces and off-policy 
learning 

We have just provided a systematic derivation of all least-squares-based algorithms for 
learning with eligibility traces in an off-policy manner. When the number of features p is 
very large, the 0{p^) complexity involved by a least-squares approach may be prohibitive. 
In such a situation, a natural alternative is to consider an approach based on a stochastic 



gradient descent of the objective function of interest (Bottou and Bousquet 



et al. 2009 Maei and Sutton, 2010) 



2011 Sutton 



9. It is possible to remove the bias when A = by using double samples. However, in the case where A > 0, 

the possibility to remove the bias seems much more difficult. 
10. A is by construction positive definite, and A equals A plus a positive term (the variance term), and is 
thus also positive definite. 
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In this section, we will describe a systematic derivation of stochastic gradient based 
algorithms for learning in an off-policy manner with eligibility traces. The principle followed 
is the same as for the least-squares-based approaches: we shall instantiate the algorithmic 
pattern of Equation Q by choosing the value of ^ and update the parameter so as move 
towards the minimum of J{Oi,^) in Equation Q using a stochastic gradient descent. To 
make the pattern of Equation Q precise, we need to define the empirical approximate 
operator we use. We will consider the untruncated T^^ operators (written in the followings 
Tj^, with a slight abuse of notation): 

n 

flv = v{s,) + J2ijxy~' {pif,v - pr V(.,)) (16) 

j=i 

where n is the total length of the trajectory. 

It should be noted that algorithmic derivations will here be a little bit more involved 
than in the least-squares case. First, with the instantiation ^ = Oi, the pattern given in 
Equation (H]) is actually a fixed-point problem onto which one cannot directly perform a 



stochastic gradient descent (this issue will be addressed in Section 4.2 through the intro- 



duction of an auxiliary objective function, following the approach originally proposed by 



Sutton et al. (2009)). A second difficulty is the following: the just introduced empirical 
operator T^" depends on all the trajectory after step i (on the future of the process), and 
is for this reason usually coined a forward view estimate. Though it would be possible, in 
principle, to implement a gradient descent based on this forward view, it would not be very 
memory nor time efficient. Thus, we will follow a usual trick of the literature by deriving 
recursive algorithms based on a backward view estimate that is equivalent to the forward 
view in expectation. To do so, we will repeatedly use the following identity that highlights 
the fact that the estimate T^V can be written as a forward recursion: 

Lemma 4 Let T^ be the operator defined in Equation (16) and let V G M*^. We have 



f^V = pir, + 7p,(l - \)V{s,+i) + 7Ap^f^-^+lF. 
Proof Using notably the identity p^ = Pip!^^i, we have: 

n 

f^V = V{s.) + J2{7Xy-' {p>f,V - f^-W{s,)) 
j=i 

n 

= V{s,) + p,f,V - V{s,) + 7Ap. J2 (pi^jV - ptWis,)) 

= pif^V + -fXpi {f,^,V - V{si+i)) . ■ 

To sum up, the "recipe" that we are about to use to derive off-policy gradient learning 
algorithms based on eligibility traces will consist of the following steps: 

1. write the empirical generic cost function (El) with the untruncated Bellman operator 



of Equation (16) 
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2. instantiate ^ and derive the gradient-based update rule (with some additional work 



for ^ = 6i, see Section 4.2); 



3. turn the forward view into an equivalent (in expectation) backward view. 
The next subsection details the precise derivation of the algorithms. 

4.1 Off-policy TD(A) 

Because it is the simplest, we begin by considering the bootstrap approach, that is the 
instantiation ^ = 9j-i. The cost function to be minimized is therefore: 



T.{Tj%-.-Us, 



.2 



Minimized with a stochastic gradient descent, the related update rule is (oj being a standard 
learning rate and recalling that K;(sj) = uj'^(j){si) = co'^cpi): 

= e,.i + a^<t>^ {ftVe,., - Ve.^A^^)) . (17) 

At this point, one could notice that the exact same update rule would have been obtained 
with the instantiation ^ = 0j-i. This was to be expected: as only the last term of the sum 
is considered for the update, we have j = i, therefore ^ = 6i-i = ^j~i. 



Equation (17) makes use of a A-TD error defined as 

For convenience, let also 5i be the standard (off-policy) TD error defined as 

Si{Lo) = 5^=\u) = pifiV^ - VUsi) = Pi {ri + WMi+i)) - VUsi). 
The A-TD error can be expressed as a forward recursion: 
Lemma 5 Let 6^ be the X-TD error and 6i be the standard TD error. Then for all lo, 

Proof This is a corollary of Lemma |4| 

^ ftV^-VUs^) = Piri + -tP^VUs^+l)-VUs^)+-fXp,{f^^_^^V^-VUsi+l)) 

Therefore, we get the following update rule 

with 6^{0i^i) = 6i{9i^i) + jX6^_^i{0i-i). The key idea here is to find some backward 
recursion such that in expectation, when the Markov chain has reached its steady state, it 
provides the same result as the forward recursion. Such a backward recursion is given by 
the following lemma. 
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Proposition 6 Let Zi he the eligibility vector, defined by the following recursion: 

Zi = 4>i + 7M-l^i-l- 

For all Lo, we have 

Proof For clarity, we omit the dependence with respect to u and write below 5i (resp. 6^) 
for 6i{uj) (resp. 5f{ijj)). The result relies on successive applications of Lemma pi We have: 

= E^o[<l)i5i] + E^^[(t)a\pi5l^^]. 

Moreover, we have that £'^Q[0jpi(5j^^;^] = S^g[(?!)j_ipi_i(5j^], as expectation is done according 
to the stationary distribution, therefore: 

E^,[(t)^5l] = E^,[(t)A]+l\E^,[^i^ipi^i5l] 

= E^^[(t)i5i] + -i\E^^[(t)i-ipi-i{5i + -fXpiSf^^)] 

= E^^[5i{(t)i + 7Api„i(/)i_i + {-fXfpi-ipi-2(t)i-2 + •••)] 

= E^^[5^Zi]. g 



This suggests to replace Equation (17) by the following update rule, 

Oi = Oi^i +aiZi5i{ei-i), 

which is equivalent in expectation when the Markov chain has reached its steady state. This 
is summarized in Algorithm [5] 

Algorithm 5: Off-poHcy TD(A) 

Initialization; 

Initialize vector ^oj 
Set zq = 0; 

for i = 1, 2, . . . do 

Observe (/>i,rj, c/jj+i ; 

Update traces ; 

Zi = -fXpi-iZi-i + (pi ; 

Update parameters ; 

9i = Oi-i + aiZi{piri - AcpfOi^i) ; 
end 



This algorithm has first been proposed in the tabular case by Precup et al. ( 2000 ) . An 



off-policy TD(A) algorithm (with function approximation) has been proposed by Precup 



et al. (2001 1, but it differs significantly from the algorithm just described (notably it differs 



in the definition of the traces and the projected Bellman equation, and in the fact that it 
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is constrained to episodic trajectories). Algorithm ^ has actually first been proposed much 
more recently by Bertsekas and Yu (2009a I . 



An important issue for the analysis of this algorithm is the fact that the trace zi may 



have an infinite variance, due to importance sampling (see |Yu (2010b Sec. 3.1)). As far as 
we know, the only existing analysis of off-policy TD(A) (as provided in Algorithm pi) uses 
an additional contraint which forces the parameters to be bounded: after each parameter 
update, the resulting parameter vector is projected onto some predefined compact set. This 
analysis is performed by Yu (2010b Sec. 4.1). Under the standard assumptions of stochastic 



approximations and most of the assumptions required for the on-policy TD(A) algorithm, 
assuming moreover that HqT'^ is a contraction (which we recall to hold for a big enough 
A) and that the predefined compact set used to project the parameter vector is a large 
enough ball containing the fixed point of noT'*', the constrained version of off-policy TD(A) 
converges to this fixed-point (therefore, the same solution as off-policy LSTD(A), LSPE(A) 



and FPKF(A)). We refer to Yu (2010b Sec. 4.1) for further details. An analysis of the 
unconstrained version of off-policy TD(A) described in Algorithm p] is an interesting topic 
for future research. 



4.2 Off-policy TDC(A) and off-policy GTD2(A) 

In this section, the case ^ = ^j is considered. Following the general pattern, at step i, 
we would like to come up with a new parameter Oi that moves (from 0j-i) closer to the 
minimum of the function 



10 I— )■ J(ci;, 6i) 



T^Ve. 



VL(s,) 



This problem is tricky since the function to minimize contains what we want to compute 
- 0j - as a parameter. For this reason we cannot directly perform a stochastic gradient 
descent of the right hand side. Instead, we will consider an alternative (but equivalent) 
formulation of the projected fixed-point minimization 6 = argmiui^ ||V^ — noT'^KjlP, and 
will move from 0j_i to Oi by making one step of gradient descent of an estimate of the 
function 



^\\Ve-IiQT^Ve\ 



With the following vectorial notations: 



^u. = {Vu{si) ... K(Si))^, 



T^V^ = {f^% 



$ 



51 



T-V . 



Ho = $($'$)"'$% 
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we consider the following objective function: 

2 



j{e) 






Ve - HoT^V, 



T^v^rnofVfl-t^v, 









This is the derivation followed by Sutton et al. (2009) in the case A = and by Maei and 



Sutton (2010) in the case A > (and off-policy learning). Let us introduce the following 
notation: 



gf = vf^Ve. 



(18) 



Note that since we consider a linear approximation this quantity does not depend on 9. 
Noticing that V5^{9) = (f)j — gf, we can compute \/J{9): 

T / . \ -1 



vE^^ 



E' 



^^^J 



E( 






i 



E^^ 

u=i 



-1 



(19) 






T.9M 



Vi=i 






E^^ 



Let tUi(0) be a quasi-stationary estimate of the last part, that can be recognized as the 
solution of a least-squares problem (regression of A-TD errors S^ on features ^j): 



W^{9) 






^ df{9)cj,, = argmin ^ (./.Jo; - 6^(9)) ' . 



The identification with the above least-squares solution suggests to use the following stochas- 
tic gradient descent to form the quasi-stationary estimate: 



Wi = Wi-i + j3i(t)i [di{9i-i) - (pfwi-ij . 



This update rule makes use of the A-TD error, defined through a forward view. As for the 
previous algorithm, we can use Proposition [6] to obtain the following backward view update 
rule that is equivalent (in expectation when the Markov chain reaches its steady state): 



Wi = Wi-i + Pi [zi5i{9i-i) - (l)i{(t)J Wi-iYj . 



(20) 
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Using this quasi-stationary estimate, the gradient can be approximated as: 






T.9M 



Wi. 



Therefore, a stochastic gradient descent gives the following update rule for the parameter 
vector 6: 

0, = 0^-1+ a, (6l{0,^i)(l), - g^4>Iwi) . (21) 

Once again the forward view term 5f{0i-i)(f)i can be turned into a backward view by using 



Proposition pi There remains to work on the term g. 

First, one can notice that the term gf satisfies a forward recursion. 

Lemma 7 We have 



9t 



7Pi(l - \)(t)i+i + -fXpigij^i. 



Proof This result is simply obtained by applying the gradient to the forward recursion of 
Tj Vg provided in Lemma kl (according to 0). ■ 



Using this, the term g'f </>/ can be worked out similarly to the term 5f(0i-i)(f)i. 
Proposition 8 Let Zi be the eligibility vector defined in Proposition [^ We have 

E^o[9i<t>I] = Ef,o[jpi{l - X)(t>iJ,izf]. 

Proof The proof is similar to that of Proposition l6l Writing hi = 7/9i(l — X)4>i+i and 
r]i = jXpi, we have 

E^,[g^cpJ] = E^,[{bi + 7ji,gmf] 

= Ef,o[bi4>I] + Ef,g[r]i-i{bi + r/i5r,^+i)(/>f_i] 

= E^,[bizf]. m 



Using this result and Proposition [6] it is natural to replace Equation (21) by an update 
based on a backward recursion: 



-1 + m (^Zi6i - 7Pj(1 - X)(t)i+i{zfwi-i)j . 



(22) 



Last but not least, for the estimate Wi to be indeed quasi-stationary, the learning rates 
should satisfy the following condition (in addition to the classical conditions): 



lim 

i—>oo p, 



i 



0. 



Eqs. (22) and (20) define the off-policy TDC(A) algorithm, summarized in Algorithm pi 
It was originally proposed by Maei and Sutton ( |2010 ) under the name GQ(A). We call 
it off-policy TDC(A) to highlight the fact that it is the extension of the original TDC 



algorithm of Sutton et al. (2009) to off-policy learning with traces. One can observe - to 



our knowledge, this was never mentionned in the literature before - that when A = 1, the 
learning rule of TDC(l) reduces to that of TD(1). 
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Maei and Sutton ( 2010[ ) show that the algorithm converges with probabihty 1 to the 
same solution as the LSTD(A) algorithm (that is, to 9* = A^^b) under some technical 
assumptions. Contrary to ofF-policy TD(A), this algorithm does not requires IIoT to be 
a contraction in order to be convergent. Unfortunately, one of the assumptions made 
in the analysis, requiring that the traces Zi have uniformly bounded second moments, is 
restrictive since in an off-policy setting the traces Zj may easily have an infinite variance 
(unless the behavior policy is really close to the target policy), as noted by Yu ( 2010aP (see 
also Randhawa and Juneja (2004|). A full proof of convergence thus still remains to be 



done. 



Algorithm 6: Off-poHcy TDC(A), also known as GQ(A) 



Initialization; 

Initialize vector ^o and wq; 
Set zq = 0; 

for i = 1,2, . . . do 

Observe (/)j,rj, (/)j+i ; 

Update traces ; 

Zi = 7A/9i_iZj_i + (j)i ; 

Update parameters ; 

9i = 9i-i + ai [zi{piri - A(/)f 6'j_i) - 7pj(l - \)<j)i+i{zj Wi^iYj ; 

Wi = Wi-i+ j3i [zi{piri - l^4>j9i) - <f>i{(f)Jwi^i)j ; 



end 



Using the same principle (that is, performing a stochastic gradient descent to minimize 



J{9)), in the A = case, an alternative to TDC, the GTD2 algorithm was derived by Sutton 



et al. (2009). As far as we know, it has never been extended to off-policy learning with 



traces; we do it now. Notice that, given the derivation of GQ(A), obtaining this algorithm 
is pretty straightforward. 



To do so, we can start back from Equation (19): 



-VJ{9) 







^((/)j - 5^)(^T I Wi. 



Vi=i 



•i] /^] 



This suggests the following alternative update rule (based on forward recursion): 

9i = 9i-i+ ai((l)i 



9i)(l>Iwi- 



Using Proposition [8] it is natural to use the following alternative update rule, based on a 
backward recursion: 

9i = 9i-i + ai (^0j(0f Wj_i) - 7Pj(1 - X)(t>i+i{zJ'wi-i)j . 
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The update of Wi remains the same, and put together it gives ofF-pohcy GTD2(A), summa- 
rized in Algorithm [7j The analysis of this new algorithm constitutes a potential topic for 
future research. 

Algorithm 7: Off-poHcy GTD2(A) 

Initialization; 

Initialize vector ^o and wq; 
Set zq = 0; 

for i = 1, 2, . . . do 

Observe (/>i,ri, c/jj+i ; 

Update traces ; 

Zi = -fXpi^iZi^i + (f)i ; 

Update parameters ; 

1 + Oi (^4>i{<pjwi-i) - 7Pi(l - X)(t>i+i{zfwi-i)j ; 



Wi = Wi-i + Pi yziipin - A(j)f9i) - (l)i{4>J Wi-i 



end 



4.3 Off-policy gBRM(A) 

The last considered approach is the residual approach, corresponding to the instantiation 
^ = 00. The cost function to be minimized is then: 






2 



Following the negative of the gradient of the last term leads to the following update rule: 

2 



H-i - a-i 



V^ (ftVu. - VUsi 



^-l - OiV^ (TtV^ - V^isi) {TtVe^_, - Ve^_,is. 



^Ai 



-l + ai[(|)^-g^)6^{ei^l), 



recalling the notation g^" = VT^ K; first defined in Equation (18). 



As usual, this update involves a forward view, which we are going to turn into a backward 



view. The term (j)i6^ can be worked thanks to Proposition pi The term gf 5^ is more 
difficult to handle, as it is the product of two forward views (until now, we only considered 
the product of a forward view with a non-recursive term) . This can be done thanks to the 
following original relation (the proof being somewhat tedious, it is deferred to Appendix [C|): 
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Proposition 9 Write g^ = V^jT!- and define 



we have that 



Ci = l + (7A/9i_i)^Ci_i, 
Ci = 7Pi(l - X)(pi+iCi + 7Api_iCi-i 
and di = 5iCi + jXpi-idi-i, 



Efioi^igi] = Ef,g[SiCi + dapi{l - X)(j)i+i - 5api{l - X)(j)i+iCi\. 



This result (together with Proposition pi) suggests to update parameters as follows: 

9i = Bi-i + Oj {5i{zi + 7Pi(l - X)(j)i+iCi - CO - dapi{l - X)4)i+i) . 

This gives the off-policy gBRM(A) algorithm, depicted in Algorithm ^ One can observe 
that gBRM(l) is equivalent to TD(1) (and thus also TDC(l), cf. the comment before the 
description of Algorithm p| . The analysis of this new algorithm is left for future research. 

Algorithm 8: Off-poUcy gBRM(A) 

Initialization; 

Initialize vector ^oj 

Set zq = 0, do = 0, Co = 0, Co = 0; 

for i = 1,2, . . . do 

Observe (/>j,rj, 0j+i ; 

Update traces ; 

Zi = (pi +'jXpi^iZi-i ; 
Ci = l + {'-fXpi^i)'^Ci^i ; 

Ci = 7Pj(i - X)(f>i+ici + jXpi-iCi-i ; 

di = {piTi - A(pfei-i)ci + -fXpi-idi^i ; 

Update parameters ; 

6i = 9i^i + Oi (^{piri - A(pj9i_i){zi + 7/3^(1 - A)(/)j+iCi - Ci) - d^piil - X)<f>i+ij ; 
end 



5. Empirical Study 

This section aims at empirically comparing the surveyed algorithms. As they only address 
the policy evaluation problem, we compare the algorithms in their ability to perform policy 
evaluation (no control, no policy optimization); however, they may straightforwardly be 



used in an approximate policy iteration approach (Bertsekas and Tsitsiklis 1996 Munos 



20031). In order to assess their quality, we consider finite problems where the exact value 



function can be computed. 



More precisely, we consider Garnet problems (Archibald et al., 19951, which are a class 



of randomly constructed finite MDPs. They do not correspond to any specific application. 
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but are totally abstract while remaining representative of the kind of MDP that might be 
encountered in practice. In our experiments, a Garnet is parameterized by 4 parameters 
and is written Q(ns,nA,b,p): ns is the number of states, n^ is the number of actions, 
6 is a branching factor specifying how many possible next states are possible for each 
state-action pair (6 states are chosen uniformly at random and transition probabilities are 
set by sampling uniform random 6—1 cut points between and 1) and p is the number of 
features (for function approximation). The reward is state-dependent: for a given randomly 
generated Garnet problem, the reward for each state is uniformly sampled between and 
1. Features are chosen randomly: ^ is a. ns x p feature matrix of which each component 
is randomly and uniformly sampled between and 1, except the first row of which each 
component is set to 1 (this corresponds to a constant feature). The discount factor 7 is set 
to 0.95 in all experiments. 

We consider two types of problems, "small" and "big", respectively corresponding to 
instances ^(30,4,2,8) and ^(100,10,3,20). We also consider two types of learning: on- 
policy learning and off-policy learning. In the on-policy setting, for each Garnet a policy vr 
to be evaluated is randomly generated (by sampling randomly ua — 1 cut points between 
and 1 for each state), and trajectories (to be used for learning) are sampled according to this 
same policy. In the off-policy setting, the policy vr to be evaluated is randomly generated 
the same way, but trajectories are sampled according to a uniform policy ttq (that chooses 
each action with equal probability, that is 7ro(a|s) = — for any state-action couple). 

For all algorithms, we choose ^o = 0. For least-squares-based algorithms (LSTD, LSPE, 
FPKF and BRM), we set the initial matrices {Mo,No,Co) to 10^/ (the higher this value. 



the more negligible its effect on estimates ^^ ). We run a first set of experiments in order to 
set all other parameters (eligibility factor and learning rates) . We use the following schedule 
for the learning rates: 

OLi = ao —. and Pi = po- 



2 



More precisely, we generate one problem (MDP and policy) for each possible combination 
small/big on-policy/off-policy (leading to four problems). For each problem, we generate 
10 trajectories of length 10^ using the behaviorial policy (which is the randomly generated 
target policy in the on-policy case and the uniform policy in the off-policy case), to be used 
by all algorithms. For each meta-parameter, we consider the following ranges of values: A G 
{0,0.4,0.7,0.9,1}, ao G {10-2,10-1,10°}, Oc G {10^,102,103}, /3o G {10-^, 10-^ 10°} and 
/3c G {10^, 10^, 10^}. Then, we compute the parameter estimates considering all algorithms 
instantiated with each possible combination of the meta-parameters. This gives for each 
combination a family Oi^d with i the number of transitions encountered in the d trajectory. 
Finally, for each problem and each algorithm, we choose the combination of meta-parameters 
which minimizes the average error on the second half of the learning curves (we do this to 
reduce the sensitivity to the initialization and the transient behavior). Formally, we pick 
the set of parameters that minimizes the following quantity: 

-, 10 -, 10* 

err=— y ^ V Wmd - V'h. 

d=l j=5.l0'^ 



11. We observed that this parameter did not play a crucial role in practice. 
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We provide the empirical results of this first set of experiments in Table |3] to [6] As a 
complement, we detail in Figure [T] the sensitivity of all algorithms with respect to the main 
parameter A that controls the eligibility traces. We comment these results below. 
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Table 3: Small problem (t/(30,4,2,8)), on-policy learning (vr = vro). 

Table [3] shows the best meta-parameters for 10 trajectories of a single instance of a 
small Garnet problem in an on-policy setting, as well as related efficiencies. Numerically, 
all least-squares-based methods provide equivalent performance, with similar choices of the 
eligibility factor (which is the only meta-parameter) . TD gets its best results with a small 
A (we believe this is the case because the MDP has few states). The new gBRM algorithm 
and GTD2 algorithm work well, but picking a higher value of A. Eventually, TDC performs 
slightly worse. Figure M (top, left) shows that the choice of A does not matter, except for 
FPKF that requires A = 1 to be efficient; with this value FPKF is almost identical to 
LSPE(l) and LSTD(l) (cf. the discussion at the end of Section 3.3). 
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Table 4: Big problem (^(100, 10, 3, 20)), on-policy learning (vr = ttq) 



Table |4] shows the best meta-parameters for 10 trajectories of a single instance of a big 
Garnet problem in an on-policy setting, as well as related performance. These results are 
consistent with those of the small problem, in the on-policy setting (with slightly different 
meta-parameters). The main difference are that GTD2 performs here the worse. Figure [I] 
(top, right) suggests that as the problem's size grows, the role of the eligibity factor gets 
more prominent: most algorithms need a relatively high value of A to perform the best. 
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Figure 1: Sensitivity of performance of the algorithms (y-axis, in logarithmic scale) with 
respect to the eligibility trace parameter A (x-axis). Left: Small problem 
(^(30,4,2,8)), right: Big problem (^(100, 10,3,20)). Top: on-policy learning 
(vr = TTo), bottom: off-policy learning (vr ^ ttq). 



Table [5] reports the best meta-parameters in an off-policy setting for a small problem. In 
terms of performance, Least-squares approaches are no longer equivalent. LSTD and LSPE 
get the best results, with the smallest possible value A = 0: we believe that this choice 
is due to the fact that higher eligibility factor increases the variance due to importance 
sampling. FPKF and BRM need large values of A to work well, and suffer much more from 
the off-policy aspect. Figure [I] (bottom, left) suggests that FPKF and BRM need a high 
value of A (to "catch" the good performance of LSTD/LSPE) but then suffers from the vari- 
ance due to importance sampling. Regarding gradient-based methods, TD's performance is 
good (it is close that of LSTD/LSPE), followed closely by GTD2 (both being better than 
FPKF/BRM). TDC and gBRM lead to the worse results; as both methods choose A = 1, 
they here turn out to be equivalent to TD(1) (cf. the discussions after Algorithms p] and 
|8lQ As for FPKF/BRM with respect to LSTD/LSPE, Figure [l| (bottom, left) further sug- 

12. In particular, one can observe that the performance of gBRM(l) and TDC(l) in Tableplare numerically 
equal. 
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Table 5: Small problem (^(30,4,2,8)), off-policy learning (vr ^ ttq) 



gests that TDC and gBRM need a high value of A in order to get a reasonable performance, 
but then suffer from the variance of importance sampling. Eventually, Table [6] shows the 
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Table 6: Big problem (^(100, 10,3,20)), off-pohcy learning (vr / ttq). 



meta-parameters and performance in the most difficult situation: the off-policy setting of 
the big problem. These results are consistent with the off-policy results of the small prob- 
lem, summarized in Table |5} LSTD and LSPE are the most efficient algorithms and choose 
the smallest possible value A = 0. FPKF and BRM's performance deteriorate (significantly 
for the latter). TD behaves reasonably good (it is in particular much better than FPKF) 
and GTD2 follows closely. The performance of TDC and gBRM are the worse, the latter's 
by a significant amount. Figure II] (bottom, right) is similar to that of the small problem. 

The main goal of the series of experiments we have just described was to choose rea- 
sonable values for the meta-parameters. We have also used these experiments to quickly 
comment the relative performance of the algorithms, but this is not statistically significant 
as this was based on a single (random) problem. Though we will see that the general be- 
havior of the algorithm is globally consistent with what we have seen so far, the series of 
experiments that we are about to describe aims at providing such a statistically significant 
performance comparison. For each situation (small and big problems, on- and off-policy), 
we fix the meta-parameters to the previously reported values and we compare the algo- 
rithms on several new instances of the problems. These results are reported on Figures. [2] 
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to ^ For each of the 4 problems, we randomly generate 100 instances (MDP and policy 
to be evaluated). For each such problem, we generate a trajectory of length 10^. Then, all 
algorithms learn using this very trajectory. On each figure, we report the average perfor- 
mance (left), measured as the difference between the true value function (computed from 
the model) and the currently estimated one, WV" — ^0\\2, as well as the associated standard 
deviation (right). 



on-policy, small problem 



on-pollcy, small problem 





Figure 2: Performance for small problems (^(30,4,2,8)), on-policy learning (vr = ttq) (left: 
average error, right: standard deviation). 
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Figure 3: Performance for big problems (^(100, 10, 3, 20)), on-policy learning (vr = ttq) (left: 
average error, right: standard deviation). 



We begin by discussing the results in the on-policy setting. Figure [2] compares all 
algorithms for 100 randomly generated small problems (that is, each run corresponds to 
different dynamics, reward function, features and evaluated policy), the meta-parameters 
being those provided in Table |3j All least-squares approaches provide the best results and 
are bunched together; this was to be expected, as all algorithms use A close to 1. In these 
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problems, gBRM works better than other gradient-based methods, followed by TD and 
GTD2/TDC. Figure p] compares the algorithms for 100 randomly generated big problems, 
the meta-parameters being those provided in Table |4j These result are similar to those of 
the small problem in an off-policy setting, except that TDC is now faster than GTD2, that 
TD is sHghtly faster than gBRM. 
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Figure 4: Performance for small problems (^(30,4,2,8)), off-policy learning (vr ^ ttq) (left: 
average error, right: standard deviation). 
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Figure 5: Performance for big problems (^(100, 10, 3, 20)), on-policy learning (vr ^ ttq) (left: 
average error, right: standard deviation). 



We now consider the off-policy setting. Figure |4] provides the average performance 
and standard deviation of the algorithms (meta-parameters being those of Table ^ on 100 
small problems. Once again, we can see that LSTD/LSPE provide the best results. The 
two other least-squares methods (FPKF and BRM) are overtaken by the gradient-based 
TD algorithm. GTD2 is quite slow (slower than TD) and TDC/gBRM (that are identical 
to TD(1) since they both use A = 1) are the slowest algorithms. Figure ^ provides the 
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same data for the big problems (with the meta-parameters of Table p|. These results are 
similar to those of the small problems in an off-policy setting. The main differences are 1) 
that FPKF appears to converge faster than TD but with a bigger standard deviation, and 
2) gBRM (that uses A = 0) does not work anymore (it is here equivalent to the standard 
algorithm by ([Baird 19951 and probably suffers from the well-known associated bias issue). 



Summary Overall, our experiments suggest that the two best algorithms are LSTD/LSPE, 
since they converge much faster in all situations. The gradient-based TD algorithm glob- 
ally diplays a good behavior and constitutes a good alternative when the number p of 
features is too big for least-squares methods to be implemented. Though some new algo- 
rithms/extensions show interesting results (FPKF(A) is consistently better that the state- 



of-the-art FPKF by Choi and Van Roy (20061, gBRM works well in the on-policy set- 
ting) most of the other algorithms do not seem to be empirically competitive with the 
trio LSTD/LSPE/TD, especially in off-policy situations. In particular, the algorithm in- 
troduced specifically for the off-policy setting (TDC/GTD2) are much slower than TD. 
Moreover, the condition required for the good behavior of LSPE, FPKF and TD - the con- 
traction of IIoT'^ - does not seem to be very restrictive in practice (at least for the Garnet 
problems we considered) : though it is possible to build specific pathological examples where 
these algorithms diverge ^^ this never happened in our experiments. 



6. Conclusion and Future Work 

We have considered least-squares and gradient-based algorithms for value estimation in an 
MDP context. Starting from the on-policy case with no trace, we have recalled that several 
algorithms (LSTD, LSPE, FPKF and BRM for least-squares approaches, TD, gBRM and 
TDC/GTD2 for gradient-based approaches) fall in a common algorithmic pattern (Equa- 
tion (pi)). Substituting the original Bellman operator by an operator that deals with traces 
and off-policy samples naturally leads to the state-of-the-art off-policy trace-based versions 
of LSTD, LSPE, TD and TDC, and suggests natural extensions of FPKF, BRM, gBRM 
and GTD2. This way, we surveyed many known and new off-policy eligibility traces-based 
algorithms for policy evaluation. 

We have explained how to derive recursive (memory and time-efficient) implementations 
of all these algorithms and discussed their known convergence properties (including an 
original analysis of BRM(A) for sufficiently small A, that implies the so far not known 
convergence of GPTD/KTD). Interestingly, it appears that the analysis of off-policy traces- 
based stochastic gradient algorithms under mild assumptions is still an open problem: the 



only currently known analysis of TD (Yu 2010a I only applies to a constrained version 



of the algorithm, and that of TDC (Maei and Sutton 2010) relies on an assumption on 



the boundedness of the second moment traces that is restrictive (Yu 2010a). Filling this 



theoretical gap, as well as providing complete analyses for the other gradient algorithms 
and FPFK(A) and BRM(A) constitute important future work. 



13. A preliminary version of this article (Scherrer and Geist 20111 contains such examples, and also an 



example where an adverserial choice of A leads to the divergence of LSTD(A). 
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Finally, we have illustrated and compared the behavior of these algorithms; this con- 



stitutes the first exhaustive empirical comparison of linear methods ^^ Overall, our study 
suggests that even if the use of eligibility traces generally improves the efficiency of all al- 
gorithms, LSTD and LSPE consistently provide the best estimates; and in situations where 
the computational cost is prohibitive for a least-squares approach (when the number p of 
features is large), TD probably constitutes the best alternative. 
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Appendix A. Derivation of the recursive formulae for BRM(A) 

We here detail the derivation of off-poHcy BRM(A). We wih need two technical lemmata. 
The first one is the Woodbury matrix identity which generalizes the Sherman-Morrison 
formula (given in Lemma [l]). 

Lemma 10 (Woodbury) Let A, U, C and V he matrices of correct sizes, then: 

{A + UCV)~^ = A-^ - A~^U{C-'^ + VA-^uy^VA-^ 
The second lemma is a rewriting of imbricated sums: 
Lemma 11 Let f £ rNxNxN ^^^ n G N. We have: 

n n n n i j n i—1 j 

1=1 j=i k=i i=lj=lk=l i=2j=lk=l 



As stated in Equation (14), we have the following batch estimate for BRM(A): 



9i = argminE(^i->i - ''Pj-^i^f = i^i) ^bi 
UJ&RP j^i 

where 

i i 

V'i^i = E P'j'^^4'k and Zj^i = E p'^f^ PkTk 

k=j k=j 

and 

i i 

^i = E '^j^i'^J-^i and fej = E ^i 



To obtain a recursive formula, these two sums have to be reworked through Lemma 11 
Let us first focus on the latter: 



III 

^ ~h — 1 A ( ~m — 1 



E i'j^iZj^i = E E E p'^'^^kPf 'p 

j=l j=l k=j m=j 



i j k i j—l k 

T.T.T. pL'A(i)jp'^'pkrk + E E E pm^^^t^kPL'pjVj. 

j=l k=l m=l j=2 k=l m=l 



Writing 





Vk = 


m=l 


we have that: 




> ; /4-^Pr ^ = P'k-'yk. 

m=l 


Therefore: 







i I j i i-1 

E i^j^iZj^i = E E pr^yk^(l>jpkrk + E E pi~^yk^(t>kPjrj. 

j=l j=lk=l j=2k=l 
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With the following notations: 

j 



z. 



■J 



Yl Pk VkPkrk = jXpj-iZj-i + PjTjVj 



k=l 

j 
and T)j = ^ pl~ yk^4>k = -fXpj-i^j-i + Vj^cpj, 

k=l 
and with the convention that zq = and Sq = 0, one can write: 

i i 

Yl i'j^iZj^i = Y^^'i^jPi^iyJ + 7Xpj-i{^<PjZj-i + Pjrj^j-i)) 
Similarly, on can show that: 

i i 

i=i i=i 

Denoting 



_ jXpj-i 






and I2 the 2x2 identity matrix, we have: 



J2 Ipj^ilpJ^i = Y^iUj + Vj){Uj + Vjf - VjvJ) 

— - ' ' ' [ui + viy 



Y '^j^ii'J^i + [ui + vi Uj j h 



J=l 



T 

-vi 



=u. 



--V, 



We can apply the Woodbury identity given in Lemma 10 

-1 



= Ci-i — Ci^iUi {I2 + ViCi-iUi) ViCi-i. 



The other sum can also be reworked: 



I 



bi^i + A(t>iriyi + 7A i^i^m + A^iZi_i) = 6i_i + Ui \ " '_'^^^ *" 

■v 
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Finally, the recursive BRM(A) estimate can be computed as follows: 

This gives BRM(A) as provided in Algorithm H] 

Appendix B. Proof of Theorem [s] (Convergence of BRM(A)) 

The proof of Theorem |3] follows the general idea of that of Proposition 4 of [Bertsekas and 



Yu (2009b). It is done in 2 steps. First we argue that the limit of the sequence is linked to 
that of an alternative algorithm for which one cuts the traces at a certain depth I. Then, we 
show that for all depth I, this alternative algorithm converges almost surely, we explicitely 
compute its limit and make / tend to infinity to obtain the limit of BRM(A). 

We will only show that -^Ai tends to A. The argument is similar for ^6j — >■ b. Consider 
the following /-truncated version of the algorithm based on the following alternative traces 
(we here limit the "memory" of the traces to a size I): 

k 

yk,i = E (Pm ')' 

m,=niax(l,A:— i+l) 
J 

^j,i = Yl pi~^yk,A4>k 

fc=max(l,j— Z+1) 

and update the following matrix: 

Ai^i = ii_i,, + A(t)iA(l)fyi^i + pi_i(A(/.iDf_i , + Di_i,,A0f ). 



The assumption in Equation (15) implies that pr- < /S-' *, therefore it can be seen that 
for all k, 



uia,x{0,k—l) niax(0,fc— Z) „2l 

\yk,i-yk\= E (P^"V< E /3'^'-"^^<Y3^ = ^i(0 

m=l m=l '~^ 

where ei(/) tends to when I tends to infinity. Similarly, using the fact that t/k < -,_o2 and 
writing K = max^^^' {{(pis) — 70(s')||oo, one has for all j, 

max(0,i-/) j 

\\'^j,i-^j\\oo < E pi" llyfe^</'fc||oo+ E Pk' \yk,i - yklW^fPkWoo 

k=l A;=max(l J— Z+l) 



max(0,i-/) ^ j r,2l 



fc=l '~^ A;=max(l,jf— /+1) 

1-/31-13^ 1-131-13' 



< ^^ T^K + T^^^K = €2(1) 
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where e2{l) also tends to 0. Then, it can be seen that: 



I A 



i,l 



A, 



T/ 



Ai_i^i - Ai^i + A0iA0j {yi^i - yi) 



+ pi_i(A0i(D/_i , - D,^_i) + (D,_i,i - 2),_i)A0,^ 



A 



i—l oo 



+ ||A(/.iA(/)f llooly^ - yk\ + 2/3||A0i| 



< ||i,_i,i - i,-i||oo + K^eiH) + 2/3^62(0 
and, by a recurrence on i, one obtains 



ID 



i-l,/ 



3,; 



A,- 



i,« 



A 
i 



<<l) 



where e(/) tends to when / tends to infinity. This impHes that: 

hminf — ^ e(0 — hminf —r- < hmsup —7- < hmsup — r- + e(0- 

In other words, one can see that hnij_!.oo -^ and hni;_!.oo hnij^oo —^ are equal if the latter 
exists. In the remaing of the proof, we show that the latter limit indeed exists and we 
compute it explicitely. 

Let us fix some I and let us consider the sequence (— ^). At some index i, yj^ depends 
only on the last / samples, while 'Di^i depends on the same samples and the last / values of 
y^y, thus on the last 21 samples. It is then natural to view the computation of Ai^i, which is 
based on yj^^, T)i-i^i and /S.(j)i = (j)i — "ypicpi+i, as being related to a Markov chain of which 
the states are the 21 + 1 consecutive states of the original chain {si^2i, ■ ■ ■ ,Si, Sj+i). Write Eq 
the expectation with respect to its stationary distribution. By the Markov chain Ergodic 
Theorem, we have with probability 1: 



iim — — 

i— >oo i 



En 



A(l)2lA(f)2iy2l,l + X-fP2l-liA(l)2l1}2l-iJ + T>2l-l,lA(f). 



21) 



(23) 



Let us now explicitely compute this expectation. Write Xi the indicator vector (of which 
the k coordinate equals 1 when the state at time i is k and otherwise). One has the 
following relations: (pi = ^ Xi. Let us first look at the left part of the above limit: 



Eo [A(l}2iA(p. 



liViiA 



Eo i4>2l - ^P2l4>'2l+l){4>2l - lP2l<t>2l+l)^y2l,l 



Eo 



*^fe! - 1P21X21 + 1){X21 - IPnXil + lf"^ Y^ (A7)'<''"™npm"')' 



\m^l + l 



J2 {^if-^'-^^Eo [C 

, m = i + l 
' 21 

J2 (A7)'<''-"'i?o [(X 

, m = i + l 



Pm ^f(^2l - ■yP2lX2l+l){x2l - ■yp2lX2l+l)'^] 



m,2l,2l — 'yXm,2l,2l + l — l^m,2l + l,2l + 7 ^m,2l + \,2l + l 



)] * 



where we used the definiton p- = (A7)'^ ■'p- and the notation Xm,i,j = Pm^f4n ^XixJ. 
To finish the computation, we will mainly rely on the following Lemma: 
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Lemma 12 (Some identities) Let P he the matrix of which the coordinates are pss' = 
^^7r(s, a)p(s, a)T(s, a, s'), which is in general not a stochastic matrix. Let fiQ be the sta- 
tionary distribution of the behavior policy ttq. Write Di = diag ( [P Yfio). Then 

Vm < i, Eo[Xm,i,i] = Di-m 

ym<i< j, Eo[Xrn,i,j] = Di.mP^-' 

\/m< j < i, Eo[Xm,i,j] 
Proof We first observe that: 






Eo[Xm,i,i] 



^-1^2 






t-^i 






diag(^o[(pL^)^a 



To provide the identity, we will thus simply provide a proof by recurrence that EQ[[p\'^^)'^Xi\ = 
(P )'"~*/io- For i = m, we have E'ois^m] = /f^o- Now suppose the relation holds for i and let 
us prove it for i + 1. 

Eo[{pinfxi+i] = Eo[EoM,fxi+i\J^,i] 

= Eo[ip'^'fEo[ip^fx,+i\:F,]^ . 

Write J-'i the realization of the process until time i. Recalling that Si is the state at time i 
and Xi is the indicator vector corresponding to Si, one has for all s': 

Eo[{pifxi+i{s')\Ti] = ^7ro{si,a)p{si,afT{si,a,s') 

a 

= ^'ir{si,a)p{si,a)T{si,a,s') 

a 
= Psi,s' 

= [P^x.]{s'). 
As this is true for all s', we deduce that Eo[{pi)'^Xi+i\J^i\ = P Xi and 



Eo[{plnfxi+i] 



Eoiipi^'fP^x,] 

p'^EoKpi^^rp^x.] 

Vo 



^pTy+i, 



which concludes the proof by recurrence. 

Let us consider the next identity. For i < j, 



Eo[pln^f4n^XixJ] = Eo[Eo[pl^^pi,^XixJ\J'i]] 



J-^^T, 



EoKp'-''^ 

EoKp 
diag((P^)— Vo)^-' 



^m ) X%EjQyP^ Xj \J'i\ 

E,[{p^^fx,xJP^-'^] 

r)T\m—i , 
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Eventually, the last identity is obtained by considering Ym^ij = Xm,j,i- 



Thus, coming back to our calculus. 






2{2l-in) 



D2l-m - lD2l-,nP - ^P'^ D2l-m + 7^£'2; + l-m) \ ^ 



$^(A - -fDiP - jP^Di + 7' A')* 



(24) 



l-l /-I 

with Di = J2i^7?'Dj, and A' = ^(A7)'^A+i- 



Similarly, the second term on the right side of Equation (23) satisfies: 

21-1 



Eo [p2i-iD2i-i,;A<^^;] = Eo 



P2l~l 2_^ P'k '^yk,l^4>k^<t>; 



Eo 



^(A7)2'-l-'=p2'-l ^ (pfc^-l)2 $T(^^ _ ^p,^,^,)(^2; _ 7p2,X2,+l)^*A<^: 



k=l 
/2l~l 



^^X^fl-l-k Y^ (A7)2('=-'")Bo[p^~V™'(^'fc-7Pfc^'fc+i)(^2i-7P2i:i'2i+i)^] U 



''21-1 



m — fc — i + l 
k 



^(A7)2'-l-fc ^ (A7)2''=-™'So[X™,fc,2i-7-Ym,fe + l,2i-7^m,fc,2i + l+7^^m,fc + l,2! + l] * 



^2i-l 



fe 



^ (A7)2'-i-'= ^ (A7)^('=-™' {D^^raP^''" - iD^+i-mP^'-''-^ - 7^^-^^^'+^"'= + 7'^.+!-™^''-'=) * 



m — fc — Z + 1 



Y^^^^)2l-l-k J2 (A7)2('--'") (Dfc_™P2'-^-a-7P)-76fe+i_„p2'-^-'=(/-7P)) 



V, k=l 
''21-1 



fe 



^(A7)2'-i-'= J2 (A7)'<'=-'"'(Ofe_™P-7£'fe+i-m)P''-'-'=(/-7P) 



^2i-l 



m — fc — i + 1 



:$^ ^(A7)2'-i-'=(aP-7D;)p2'-i-''-{/-7P) 
:<I>'^(AP-7D;)q,(/-7P)<J. 



withQi = E,={,(A7^P- 

Gathering this and Equation (24), we see that the limit of — ?^ expressed in Equation (23) 
equals: 



cD^ 



Di - jDiP - -fP^Di + j^Dl + A7 [{DiP - ^D[)Qi{I - 7P) + (I - jP^)Qf{P^Di - ^D[] 



$. 



When / tends to infinity, Qi tends to Q = (/ — A7P) ^. The assumption of Equation (15) 
ensures that (A7)P has spectral radius smaller than 1, and thus when / tends to infinity. 
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Di tends to D = diag ((/ - {\-ffP^)~^fio) and D[ to D' = diag (p'^{I - {X-ffP'^y^iioY 
In other words, lini;_s.oo linij_j>oo ^^ exists with probabihty 1 and equals: 



cD^ 



D - -fDP - jP^D + j'^D' + A7 ((DP - -fD')Q{I - 7P) + (/ - jP^)Q'^{P^D - -fD') 

Eventually, this shows that linij_!.oo -^ exists with probability 1 and shares the same value. 
A similar reasoning allows to show that limj_>oo ^ exists and equals 



$. 



$^ 



(/ - -fP' )Q' D + \^{DP - -fD')Q 



R^. 



Appendix C. Proof of Proposition [9] 

To prove Proposition [9j we need the following technical lemma. 

Lemma 13 Forget the notations used so far. Let ai and Pi he two forward recursions 
defined as 

ai = ai+ rjiUi+i 
and f3i = hi + r]if3i+i. 

Assume that for any function f we have tha^^ 

E[f{ai,hi,rii)] = E[f{ai^i,hi-i,r]i-i)]. 

Let also Ui, Vi and wi he the hackward recursions defined as: 

Wi = l + r]f_iWi-.i 

Ui = QiWi + ??i_llij_l 

Vi = hiWi + r]i_iVi-i 

Then, we have: 

E[aiPi] = E[aiVi + hiUi - aibiWi] 

Proof The proof looks like the one of Proposition [6j but is a little bit more complicated. 
A key equality, to be applied repeatedly, is: 

ai(3i = (oj + r]iai+i){hi + rjiPi+i) 

= aif3i + hai + rjfai+i/3i+i - aih. 

Another equality to be used repeatedly makes use of the "stationarity" assumption. For 
any A; > we have: 

k fc+l 

^[(11 vlJh^+lf3i+l] = E[il[ vlj)ai/3,]. 
i=o j=i 

15. This is typically true if the index i refers to a state sampled according to some stationary distribution, 
which is the case we are interested in. 
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These two identities can be used to work the term of interest: 

E[ail3i] = E[{ai + r/iQi+i)(6i + ruPi+i)] 

= E[aiPi] + E[biai] + E[rifai+iPi+i] - E[aibi] 

= E[ai(3i] + E[biai] - E[aibi] + E[r]f_^{ai + 7]iai+i){bi + r/i/3i+i)] 

= E[ai{l + vti)l3i] + E[bi{l + vl-ihi] - E[aibiil + r]f_^)] + ^[(7?,_ir/i)^ai+iA+i] 

This process can be repeated, giving 

E[ail3i] = E[{ai(3i + hm - aibi){l + rif^^ + (r/i_ir/i_2)^ + •••)]• 

We have that 

Wi = l + r]f_^Wi-i = 1 + r]f_i + {r]i-i-r]i-2f + • • • , 

therefore: 

E[aif5i] = E[aiWil3i] + E[biWiai] - E[aibiWi] 

We can work on the first term: 

E[aiWil3i] = E[aiWi{bi + r/j/3j+i)] 

= E[aiWibi] + E[ai-iWi-.ir]i-i{bi + r?i/3i+i)] 

= E[bi{aiWi + r?j_i(ai_iu;i_i) + r?i-ir/i_2(ai-2W^j-2) + •••)] 

= E[biUi]. 

The work on the second term is symmetric: 

E[biWiai] = E[aiVi]. 
This finishes proving the result. ■ 

The proof of Proposition [9] is a simple application of the preceding technical lemma. By 
lemma [5| we have that 

=ai =ai =r)i =0^+1 

By lemma [Tj we have that 

9i = 7Pi(l - A)^i+i + 7M 9i+i ■ 

=ft =bi =Vi =ft+i 



The result is then a direct application of lemma 13 
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